79 research outputs found

    AsterixDB: A Scalable, Open Source BDMS

    Full text link
    AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements

    Hobbes: optimized gram-based methods for efficient read alignment

    Get PDF
    Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2–10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu

    A novel class of microRNA-recognition elements that function only within open reading frames.

    Get PDF
    MicroRNAs (miRNAs) are well known to target 3' untranslated regions (3' UTRs) in mRNAs, thereby silencing gene expression at the post-transcriptional level. Multiple reports have also indicated the ability of miRNAs to target protein-coding sequences (CDS); however, miRNAs have been generally believed to function through similar mechanisms regardless of the locations of their sites of action. Here, we report a class of miRNA-recognition elements (MREs) that function exclusively in CDS regions. Through functional and mechanistic characterization of these 'unusual' MREs, we demonstrate that CDS-targeted miRNAs require extensive base-pairing at the 3' side rather than the 5' seed; cause gene silencing in an Argonaute-dependent but GW182-independent manner; and repress translation by inducing transient ribosome stalling instead of mRNA destabilization. These findings reveal distinct mechanisms and functional consequences of miRNAs that target CDS versus the 3' UTR and suggest that CDS-targeted miRNAs may use a translational quality-control-related mechanism to regulate translation in mammalian cells

    Effects of Anacetrapib in Patients with Atherosclerotic Vascular Disease

    Get PDF
    BACKGROUND: Patients with atherosclerotic vascular disease remain at high risk for cardiovascular events despite effective statin-based treatment of low-density lipoprotein (LDL) cholesterol levels. The inhibition of cholesteryl ester transfer protein (CETP) by anacetrapib reduces LDL cholesterol levels and increases high-density lipoprotein (HDL) cholesterol levels. However, trials of other CETP inhibitors have shown neutral or adverse effects on cardiovascular outcomes. METHODS: We conducted a randomized, double-blind, placebo-controlled trial involving 30,449 adults with atherosclerotic vascular disease who were receiving intensive atorvastatin therapy and who had a mean LDL cholesterol level of 61 mg per deciliter (1.58 mmol per liter), a mean non-HDL cholesterol level of 92 mg per deciliter (2.38 mmol per liter), and a mean HDL cholesterol level of 40 mg per deciliter (1.03 mmol per liter). The patients were assigned to receive either 100 mg of anacetrapib once daily (15,225 patients) or matching placebo (15,224 patients). The primary outcome was the first major coronary event, a composite of coronary death, myocardial infarction, or coronary revascularization. RESULTS: During the median follow-up period of 4.1 years, the primary outcome occurred in significantly fewer patients in the anacetrapib group than in the placebo group (1640 of 15,225 patients [10.8%] vs. 1803 of 15,224 patients [11.8%]; rate ratio, 0.91; 95% confidence interval, 0.85 to 0.97; P=0.004). The relative difference in risk was similar across multiple prespecified subgroups. At the trial midpoint, the mean level of HDL cholesterol was higher by 43 mg per deciliter (1.12 mmol per liter) in the anacetrapib group than in the placebo group (a relative difference of 104%), and the mean level of non-HDL cholesterol was lower by 17 mg per deciliter (0.44 mmol per liter), a relative difference of -18%. There were no significant between-group differences in the risk of death, cancer, or other serious adverse events. CONCLUSIONS: Among patients with atherosclerotic vascular disease who were receiving intensive statin therapy, the use of anacetrapib resulted in a lower incidence of major coronary events than the use of placebo. (Funded by Merck and others; Current Controlled Trials number, ISRCTN48678192 ; ClinicalTrials.gov number, NCT01252953 ; and EudraCT number, 2010-023467-18 .)

    Indexed and Distributed Processing of Similarity Selection and Join Queries

    No full text
    A similarity query is to find from a collection of items those that are similar to a given query item. Answering similarity queries is important in applications such as DNA sequence assembly, record linkage, and query relaxation, where errors could occur in queries as well as the data. The wide relevance of similarity queries presents to us application-specific constraints and desiderata. In this thesis, we develop and evaluate indexes and algorithms for answering such queries efficiently in the context of four different settings. First, we present a DNA-specific alignment method whose primary focus is on the speed of query execution. Our results show a performance improvement of 2-10X compared to existing state-of-the-art packages. Second, we develop a flexible compression technique for reducing the size of an inverted index to a given amount of space while retaining efficient query processing. In our experimental evaluation we could reduce the index size up to 60% without sacrificing query response times. Third, we study external-memory solutions well-suited for database management systems, where the data and indexes are stored on disk, and demonstrate a substantial benefit over alternative methods. Fourth, we discuss how we incorporated some of our solutions into the ASTERIX parallel database system to optimize similarity queries via secondary indexes. We elaborate on our design choices and query-processing capabilities, and conclude with experiments on large-scale data

    Space-constrained gram-based indexing for efficient approximate string search

    No full text
    Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are “notoriously” large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60 % reduction) with tolerable performance penalties, for 20-40 % reductions we can even improve query performance compared to original indexes
    corecore